We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single "person" class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%.
translated by 谷歌翻译
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
translated by 谷歌翻译
不确定性估计是任何部署的机器学习系统中的关键组件。评估不确定性估计的一种方法是使用“分布外”(OOD)检测,即使用不确定性区分训练数据分布和看不见的不同数据分布。在这项工作中,我们表明当前特征密度的不确定度估计器不能在不同的OOD检测设置上保持良好。为了解决这一问题,我们建议分解学习的陈述,并将估计的不确定性分别相结合。通过实验,我们证明我们可以大大提高不确定性估计的性能和可解释性。
translated by 谷歌翻译
目前,这是一个热门的研究主题,可以在深度学习和物联网技术的帮助下实现大量光谱数据的准确,高效和实时识别。深度神经网络在光谱分析中起着关键作用。但是,更深层模型的推断是以静态方式进行的,不能根据设备进行调整。并非所有样本都需要分配所有计算以实现自信的预测,这阻碍了最大化整体性能。为了解决上述问题,我们提出了一个具有自适应推理的光谱数据分类框架。具体而言,要为不同样本分配不同的计算,同时更好地利用不同设备之间的协作,我们利用早期外观体系结构,将中间分类器放置在架构的不同深度,并在预测置信度达到预设阈值时输出结果。我们提出了一个自我介绍学习的训练范式,最深的分类器对浅的分类器进行了软监督,以最大程度地提高其性能和训练速度。同时,为了减轻早期外观范式中中间分类器的位置和数字设置的性能脆弱性,我们提出了一个自适应的残留网络。它可以调整不同曲线位置下每个块中的层数,因此它可以专注于曲线的重要位置(例如:拉曼峰),并根据任务性能和计算资源准确地分配适当的计算预算。据我们所知,本文是首次尝试通过自适应推断物联网平台下的光谱检测来进行优化。我们进行了许多实验,实验结果表明,我们所提出的方法可以比现有方法实现更高的计算预算性能。
translated by 谷歌翻译
虚拟面部化身将在身临其境的沟通,游戏和元视频中发挥越来越重要的作用,因此至关重要的是包容性。这需要准确地恢复出现,无论年龄,性别或种族如何,都以反照率表示。尽管在估计3D面部几何形状方面取得了重大进展,但反照率估计受到较少的关注。该任务在根本上是模棱两可的,因为观察到的颜色是反照率和照明的函数,这两者都是未知的。我们发现,由于(1)偏爱较轻的色素沉着和(2)算法溶液,因此当前的方法偏向浅色肤色,而无视光/反照率的歧义。为了解决这个问题,我们提出了一个新的评估数据集(公平)和算法(Trust),以改善反照率估计以及公平性。具体而言,我们创建了第一个面部反照率评估基准,其中受试者在肤色方面保持平衡,并使用单个类型学角度(ITA)度量测量精度。然后,我们通过建立关键观察结果来解决光/反照率的歧义:与面部的裁剪图像相反,整个场景的图像包含有关照明的重要信息,可用于歧义。信任通过在面部区域和从场景图像中获得的全球照明信号进行调节来回归面部反照率。我们的实验结果表明,就准确性和公平性而言,与最先进的反照率估计方法相比,相比之下。评估基准和代码将用于研究目的,网址为https://trust.is.tue.mpg.de。
translated by 谷歌翻译
注意机制使图形神经网络(GNN)能够学习目标节点与其单跳邻居之间的注意力权重,从而进一步提高性能。但是,大多数现有的GNN都针对均匀图,其中每一层只能汇总单跳邻居的信息。堆叠多层网络引入了相当大的噪音,并且很容易导致过度平滑。我们在这里提出了一种多跃波异质邻域信息融合图表示方法(MHNF)。具体而言,我们提出了一个混合元自动提取模型,以有效提取多ihop混合邻居。然后,我们制定了一个跳级的异质信息聚合模型,该模型在同一混合Metapath中选择性地汇总了不同的跳跃邻域信息。最后,构建了分层语义注意融合模型(HSAF),该模型可以有效地整合不同的互动和不同的路径邻域信息。以这种方式,本文解决了汇总MultiHop邻里信息和学习目标任务的混合元数据的问题。这减轻了手动指定Metapaths的限制。此外,HSAF可以提取Metapaths的内部节点信息,并更好地整合存在不同级别的语义信息。真实数据集的实验结果表明,MHNF在最先进的基准中取得了最佳或竞争性能,仅1/10〜1/100参数和计算预算。我们的代码可在https://github.com/phd-lanyu/mhnf上公开获取。
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
Supervised Question Answering systems (QA systems) rely on domain-specific human-labeled data for training. Unsupervised QA systems generate their own question-answer training pairs, typically using secondary knowledge sources to achieve this outcome. Our approach (called PIE-QG) uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages and uses the question-answer pairs as training data for a language model for a state-of-the-art QA system based on BERT. Triples in the form of <subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers. Experimenting on five extractive QA datasets demonstrates that our technique achieves on-par performance with existing state-of-the-art QA systems with the benefit of being trained on an order of magnitude fewer documents and without any recourse to external reference data sources.
translated by 谷歌翻译
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
translated by 谷歌翻译
Knowledge graph embedding (KGE), which maps entities and relations in a knowledge graph into continuous vector spaces, has achieved great success in predicting missing links in knowledge graphs. However, knowledge graphs often contain incomplete triples that are difficult to inductively infer by KGEs. To address this challenge, we resort to analogical inference and propose a novel and general self-supervised framework AnKGE to enhance KGE models with analogical inference capability. We propose an analogical object retriever that retrieves appropriate analogical objects from entity-level, relation-level, and triple-level. And in AnKGE, we train an analogy function for each level of analogical inference with the original element embedding from a well-trained KGE model as input, which outputs the analogical object embedding. In order to combine inductive inference capability from the original KGE model and analogical inference capability enhanced by AnKGE, we interpolate the analogy score with the base model score and introduce the adaptive weights in the score function for prediction. Through extensive experiments on FB15k-237 and WN18RR datasets, we show that AnKGE achieves competitive results on link prediction task and well performs analogical inference.
translated by 谷歌翻译